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Abstract 

In computing, as in many aspects of life, changes incur cost. Many optimization problems 
are formulated as a one-time instance starting from scratch. However, a common case that arises 
is when wc already have a set of prior assignments, and must decide how to respond to a new 
set of constraints, given that each change from the current assignment comes at a price. That 
is, we would like to maximize the fitness or efficiency of our system, but we need to balance it 
with the changeout cost from the previous state. 

We provide a precise formulation for this tradeoff and analyze the resulting stable extensions 
of some fundamental problems in measurement and analytics. Our main technical contribution 
is a stable extension of PPS (probability proportional to size) weighted random sampling, with 
applications to monitoring and anomaly detection problems. We also provide a general frame- 
work that applies to top-k, minimum spanning tree, and assignment. In both cases, we are 
able to provide exact solutions, and discuss efficient incremental algorithms that can find new 
solutions as the input changes. 

1 Introduction 

Most textbook optimization problems have a clean statement: given a set of constraints and an 
optimization target, we are free to choose any feasible solution, and so can strive for the optimal 
result. However, when applying these techniques, we may find that we do not begin with a clean 
slate. Rather, we often have some (partial) assignment, perhaps resulting from a previous opti- 
mization over a prior input. This assignment is unlikely to be optimal for the current instance, but 
each modification to the current assignment can come with a cost. We therefore have to balance the 
improvement from a new assignment (the fit) with the cost of changing from the current assignment 
to the new one (the stability). 

Consider a network operator who is leasing network connections to meet customer demand. 
When there are no existing leases, this can be modeled as a standard graph optimization problem. 
But after an initial solution is found, customer demand may change. The operator then has to 
solve a more complex optimization, since there is a (fixed) cost to breaking an existing lease and 
establishing a new one. This cannot easily be represented as an instance of the original problem, 
due to the mixture of pricing schemes. 

Formalizing this setting, we target applications where the input state x ^ X changes over time. 
We maintain an output S & S that can be changed in response to changes in the input. We are 
interested in the tradeoff between the fit (or efficiency) of the output with respect to the current 
input and the changeout cost from the previous output. This requirement for 'stability' arises in a 
variety of different applications: 
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Human-interpretable output: In many settings devoted to monitoring and managing a system, there 
is a "dashboard" that presents information to an operator. For example, this could be details of 
the most congested links in a network, the heaviest users of the system, locations with anomalous 
activity, and so on. The underlying information may be rapidly changing, so that constantly 
updating the display would lead to a confusing mess of information. Instead, the operator should 
be presented with a view which is reasonably stable, so that they can absorb the information, and 
track the change in properties of items over time. At the same time, the output should be close to 
optimal in detailing the most important items. 

The overhead of change: In a client-server assignment or a routing problem, changes in the output 
may correspond to cache swaps, job migration, or modifications to a routing table. In all of these 
cases, each change incurs a cost, and very rapid change can lead to poor performance: delays, due 
to routes changing midway, or low throughput due to constant reassignment of machines. We are 
therefore willing to settle for a solution that is sub-optimal at any instant, but which changes only 
slowly. 

Audience-aware advertising: Consider a roadside electronic billboard that can respond to the set 
of road users passing by. We can choose an advertisement to display in response to demographics 
of the users, but constantly changing the ad will make it impossible for any of them to absorb. 

Allocating Resources to Monitoring: Analysis such as change or anomaly detection requires tracking 
behavior of items over time. In high-throughput systems, it is feasible to collect detailed statistics 
on only a representative subset of active items. Picking the items to monitor that are currently 
the most 'interesting' (top-A; or a random sample) may fail to keep any around long enough to 
build sufficient history for particular items, so we prefer to retain monitored items while they are 
useful. Moreover, to efficiently store historic summaries for an extended time period, we only need 
to record the modifications to the summary made over time, so limiting modifications means that 
less storage is required. 

Contributions: We formalize and motivate the algorithmic problem of balancing fitness and sta- 
bility. Our main results concern the stable extension of some fundamental problems in measurement 
and analytics. We first consider sampling with stability, and focus on PPS (probability proportional 
to size) weighted random samples. We then continue to a selection of fundamental optimization 
problems which includes matroids and introduce a unified model for the stable extension of top-A; 
set. Minimum Spanning Tree, and Assignment. An empirical study of our methods demonstrates 
their advantages over alternative approaches. Last, we point to further classic problems with nat- 
ural stable extensions. 

• Stable PPS sample: A PPS sample is a weighted random sample where the sampling prob- 
ability of each entry is proportional to its weight, but truncated to be at most 1 [11]. PPS 
sampling minimizes the sum of variances across all distributions with the same expected size 
sample. We study PPS as a method to choose a representative selection of input data (as 
needed in the motivating applications), and consider its stable extension. In this stable ex- 
tension, the output is a vector p of inclusion probabilities which sum to k and the fitness 
is the respective sum of variances. We also maintain a sample of entries sampled indepen- 
dently according to the current output distribution. We measure the changeout cost by the 
expectation of the set difference of samples between the two sample distributions. Under the 
appropriate procedure, we show that this can be kept to the minimal value, the Li difference 
between the initial and target distributions. 
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• Optimization problems with additive fitness function and fixed-size output: The simplest prob- 
lem in this class, where there are no further restrictions on the output beyond having a fixed 
size k, is to produce the top-A; set, containing the k heaviest entries. In the stable extension, 
a valid output is any subset of size k, where the fitness of a particular subset is the sum of 
values of entries included in the subset and the changeout cost is the set difference between 
outputs. The stable extension of top-k models monitoring of heavy users or maintaining the 
most effective cache. This framework also applies to assignment, i.e. minimum bipartite 
matching (where output must be a bipartite matching), and minimum spanning tree (output 
must be a spanning tree). 

We establish a relation between the stable extension of an optimization problem in this class 
and instances of the original optimization problem on a parametrically specified modified in- 
put. This relation allows us to apply an algorithm for the original optimization problem as a 
black box to obtain a stable solution. Moreover, we establish a mapping between a dynamic 
version of the stable extension (where we must efficiently modify the output in response to 
incremental updates to the input) and the dynamic version of the original optimization prob- 
lem. This mapping similarly allows off-the-shelf applications of existing dynamic algorithms 
to efficiently recompute a stable solution when the input is modified. 

2 Relation to Time decay models 

A natural first attempt at providing stability of output is to stabilize the input, which is commonly 
performed using some time-decayed average (via an exponential, polynomial, or sliding window 
decay function) over recent input states [U [6] . Decay models are successful in capturing the current 
state when inputs can be modeled as noisy samples from a slowly changing underlying distribution. 
In this case, the time-decayed (moving) average better captures the "real" current state than the 
current input. 

When using decay models directly to obtain stability of output, we can compute the best-fit 
output for the time-decayed average. This approach lacks, however, when similarity of the input 
does not guarantee similarity, or stability, of the output. 

For example, the exact location of optimal cluster centers will vary even on smoothed input, 
and so incur the fixed changeout cost. Another example is a top-k set over an input where 2k values 
are incremented uniformly and independently at random. The top-A; set has sensitivity to small 
perturbation, which smoothing the data will not eliminate. Ideally, we would like the output to be 
stable over time: the small benefit of choosing the top-ranked values does not outweigh the cost of 
output changes. Moreover, when the output function is insensitive to small perturbations in the 
input, we cannot fine tune the tradeoff of fit and stability by controlling the decay parameter: the 
stability of the output is not necessarily an increasing function of the decay parameter. Even if we 
could do so, we would have no analytic guarantees over the tradeoff. In contrast, our formulation 
shows that we can find optimal solutions for any point on the tradeoff curve. 

Therefore, as we also demonstrate empirically, while time-decay can be combined with our 
models as a first phase, (such as to replace the input by its time-moving average when the latter 
better captures the current state), it is not designed to optimally trade off stability and fit, and 
thus, in and of itself, does not produce a satisfactory solution to our problem. 

3 Model 

The fit of output 5 E 5 to input x £ X is measured by a fitness function cf) : S x X ^ R. The 
best-fit OPT(ic) is the output with maximum fitness for input x: 

OPT(a;) = argmax0(S', a;) . 
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Output domains S and corresponding best-fits can be assignments (minimum cost assignment), 
routing tables (shortest path routing), subsets of size k (top-fc set), spanning trees (MST), or 
sampling distributions (PPS sampling). The changeout cost from output Si to S2 is a distance 
function on S x S: d(S'i, ^2) > 0. 

When presented with a new input x, and a current output S' , we are interested in a A.- stable 
optimum, which is the best-fit output subject to a limit D on changeout cost, and its respective 
fitness: 

A-OPT(S'',£C,i:)) = arg max MS.x) 

S\d(S,S')<D^ 

A-4>(S',x,D) = max (t)(S,x) 
S\d{S,S')<D 

When fixing x and 5", and varying the allotted change D, we obtain the optimal tradeoff between 
fit and changeout cost using the points (D, A-(f){S' , x, D)). 

The a-stable optimum S is defined with respect to a parameter a, which specifies a relation 
between changeout cost and improvement in fitness: 

a-OPT(5', X, a) = arg max (0(5*, x) - a d{S, S')) (1) 
a-4>{S', X, a) = (j){a-OPT{S' , x, a),x) 

By sweeping the Lagrange multiplier a, we can obtain the tradeoff using the points 

{d{S',a-OPT:{S',x,a)),a-(p{S',x,a)) . (2) 

The fitness of the a-stable output decreases with a and that of a A-optimal one increases with 
D. For a = 0, we allow D to be arbitrarily large, and the a-stable optimum is the current best-fit 
OPT(a;). When a is sufficiently large, D is forced to be small and the a-stable optimum is closer to 
the previous S'. Two desirable properties satisfied by the problems we consider are concavity and 
output monotonicity of the tradeoff. These properties simplify the computation of the tradeoff and 
also mean that the choice of the best tradeoff point is predictable and tunable. 

Concave tradeoff: Concavity means that the function A-(f){S' ,x, D) is concave in D, so that the 
marginal improvement in fitness for increase in allowed changeout cost is non-increasing. When 
the function is continuous and differentiable in D, the marginal improvement is the first derivative 
^^-'t>{S^^,D) concavity means ^ ^"gip'"'''^'' < 0. In this case, when the parameters satisfy 
A-OPT(S', x, D) = a-OPT(S', X, a), then a = ^^i^^^. 

Output monotonicity: When the output can be expressed as a vector, monotonicity means that 

each entry (inclusion, sampling probability, etc.) is monotone when sweeping a and looking at the 
a-stable optimum (or equivalently, when sweeping D and looking at the A-stable optimum). In 
particular, when output entries are binary, then for each i G S, there is at most one threshold value 
T so that i ^ a-OPT{S, x, a) a < t and for each i ^ S there is at most one threshold value of 

T so that i G a-OPT(S', x, a) a < t. 

Problem formulations 

The applications we target have a sequence of inputs. A natural aim is to seek outputs that are a- 
stablc in each step with respect to the same parameter a — meaning that we "price" the changeout 
cost the same across steps. Alternatively or additionally, we may need to limit the changeout at 
each step. 
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Stable solutions and tradeoff: The underlying basic algorithmic problem is to compute, for a 
given input x and current output S, the respective A-stable and the a-stable optimum solutions. 

Incremental updates: When in each step the input is provided as a small update to the previous 
one, such as a rewrite, increment, or decrement of a single entry of the vector, we would like to 
efficiently compute a new a-stable optimum instead of applying the batch algorithm to compute it 
from scratch. 

Offline problem: In the related offline problem, the input is a sequence x^^'^ G (j G [N]) of 
states and the solution is a sequence Sj £ S {j £ [N] ) of outputs that maximizes 

^0(5„^(^))-a^d(5„5,_i) . (3) 

ie[7V] iG[7V] 

So is defined to be some initial state, or 0. To maximize stability (q — )• oo), the solution is Sj = S 
for all j G [N] where S = argmax^g^ J2jelN] 4'{'S,x^^^j^ For maximum fit, (a = 0), we use the 
best-fit at each timestep, Sj = opt{x^^^). This offline stable optimum is interesting when we seek 
to minimize global change and are willing to tolerate worse fit occasionally in some steps. We would 
then like an online solution which is closer to the offline stable optimum, and possibly use a model of 
the data to achieve this. Our formulation of computing a stable solution at each step is appropriate 
when the aim is to guarantee a continuously good fit, as in the motivating scenarios outlined in 
the introduction. Note that our model has no prediction component, which fundamentally sets it 
apart from other treatments of adapting to a changing or unknown state including metrical task 
systems [3j and regret minimization in decision theory. 

4 Stable PPS sample 

In this section, we consider the stable extension of PPS sampling. The input domain X consists 
of real non-negative vectors w £ i?" and the output domain S contains probability distributions 
specified by all vectors p G [0, 1]" such that Y17=i Pi ~ ^- While we modify the distribution, the 
actual output is a random sample where entry i is sampled with probability pi and different entries 
are sampled independentljj^ 

Fitness function: When a value Wi is sampled with probability pi, the unbiased non- negative 
estimator with minimum variance on the weight of the entry is the Horvitz-Thompson estimator, 
i.e. ^ when the entry is sampled and otherwise. The variance is wf{l/pi — 1). 

We measure the quality of a sample distribution p for input w by the sum of variances 
'^iwf{l/pi — 1). We can omit the additive term —J2i''^i (which does not depend on p) and 
obtain the fitness function ^ 

<p{p,w) = -V— . 

The best-fit for a sample of expected size k is the distribution with minimum sum of variances. 
This is achieved by PPS (Probability Proportional to Size) sampling probabilities, p = pps{k,w), 
obtained by solving for r the following equation, using pi = min{l, tUj/r}: 

minjl, Wi/r} = k . (4) 



^In general, this is not equivalent to the original optimization problem applied to the combination of all inputs. 
^We comment that it is often sufficient to only require limited independence, which is easier to obtain in practice. 
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Algorithm 1 SuBSAMPLE procedure 
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procedure SuBSAMPLE(S',p, qr) 
for i G [n] do 

if i ^ S and qi > pi then 
if rand() < 3iizPi then 
S ^ SU{if^ 

if i G S and qi < pi then 
if rand() > 2i then 
S ^S\{i} 

return S 



> i is sampled-in with probability 



> i is sampled-out with probabiUty 1 — 2i 

Pi 



Distance function: The random sample is a set of entries and our distance function measures 
the expected set difference between the samples. The distance therefore depends on the particular 
subsampling procedure we employ to move from a sample S drawn from probability distribution 
p to a new one drawn from q. Clearly, the distance must be at least ||p — q||i = J2i \Pi ~ Qi\- 
We can use a Subsample procedure (pseudo-code in Algorithm [T]), which has distance exactly 
dip, q) = — and is therefore minimal. Subsampling results in input and output samples that 
are coordinated, meaning they are as similar as possible subject to each conforming to a different 
distribution. It is possible to modify Subsample to use a VarOpt sampling procedure [5], which can 
ensure that the number of inclusions and exclusions are each between the floor and ceiling of their 
expectation. Alternatively, we can associate a permanent random number (PRN) [3] ixj S C/[0, 1] 
so that i E S Ui < Pi. This means that each sample is drawn from the right distribution, 

while the samples are coordinated across time steps. 



Example 4.1 Consider sampling a set of expected size 2 from 6 entries. Suppose we have a current 
sample S, drawn according to a uniform distribution p, where pi = 1/3 for i G [6]. This distribution 
has the best fit when the input is uniform. 
Suppose that the new input weights are 

w = (2,4,1,5,6,0) . 

The PPS sampling probabilities for a sample of size 2, obtained with threshold r = 9, are 

/2 4 1 5 2 \ 
pM2.») =(^.^.3.^,5,0). (5) 

The expected changeout cost of modifying the sample from one drawn according to the uniform 
p to one drawn according to pps{2,w) is \\p — pps{2,w)\\i = |. 

The actual changeout to the sampling distribution pps{2, w) depends on our initial sample S 
and the randomization when subsampling. Suppose S = {3, 4}, that is, contains the third and fourth 
entries. Entries {1,6} are not in S and have reduced sampling probabilities and therefore will not be 
included in the new sample. Entry 4 is in S and has increased inclusion probability and therefore will 
be included in the new sample. Entries {2, 5} which are not in S but for which inclusion probabilities 
increased, need to be sampled for inclusion with respective probabilities {1/6,1/2}. Entry 3 which 
is in S but has a reduced inclusion probability is sampled out with probability 2/3. Therefore, the 
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new sample will include entry 4 and will not include entries {1,6} but may include any subset of 
entries {2, 3, 5}. 

The stability-fit tradeoff includes distributions that are between p and pps{2,w). These distri- 
butions have higher variance but smaller expected change. In the sequel, we demonstrate how they 
are computed on this example. 

4.1 Computing the stability /fit tradeoff 

In this section we establish the foUowing: 

Theorem 4.1 An a-stable PPS distribution, ^-stable PPS distribution, and a description of the 
tradeoff mapping parameter values to fitness and vice versa, can be computed in 0{n\ogn) time. 
Moreover, the tradeoff is concave and monotone. 

Let p be the initial output and w the current weights. The A-stable distribution q is the 
solution of the following nonlinear optimization problem: 

minimize — (6) 

i * 

Vi, 1 > > 

i 

^ \qi-Pi\ < D 

i 

We are interested in efficiently solving the problem for a particular value of the changeout D 
and also in computing a compact representation of all solutions. 

We reduce ([6| to two simpler problems. Best-increase computes, for any value x, the most 
beneficial total increase of x to current sampling probabilities, i.e., that would bring the biggest 
reduction in the sum of variances. Similarly, Best-decrease computes the least detrimental total 
decrease of x, i.e., the decrease that results in the minimum increase of the sum of variances. Both 
problems can be formulated as convex programs. We show that descriptions of both functions can 
be found in time 0(n log n). Last, we show how to fully describe the A-stable PPS distributions 
and a-stable distributions in time O(nlogn) by using these functions. 

The most beneficial increase 

The most beneficial increase for x £ (0, ~ Pi)] is S which solves the following convex program. 

minimize (7) 

^Pi + Si 

Vi, 1 — Pi > (5j > 

y^^5i = X 

i 

Lemma 4.2 For any increase x G (0,X]i(l ~ Pi)] there is a threshold value y = t{x) such that 
the solution of ^ is 6i = min{l, max{pj, ifj/y}} — Pi. The function r(x) and its inverse A"'"(y) 
are continuous and decreasing and piecewise with at most 2n breakpoints, where each piece is a 
hyperbola. Moreover, a representation of both functions can be computed in O(nlogn) time. 
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Algorithm 2 The function A+ (y) and its inverse t{x) for w and p 



1: procedure Best-Increase('U7,p) 

2: W t> weight of active items 

3: £) > Contribution to difference of items that exited and negated sum of initial 

probabihties of active items 

4: L ^ > Initiahze L as an empty max-heap 

5: for i G [n] do 
6: if {wi > 0) then 

7: if {pi<l) then 

8: {A.v,A.t,A.i)^{^,"entvy\i) 

9: value, breakpoint type, item 

10: L LU A > add entry record to L 

11: else if Pi = then 

12: W + Wi 

13: {A.v, A.t, A.i) ■(- {wi, "exit" , i) 

14: > value, breakpoint type, item 

15: L L U A i> add exit record to L 

16: A GetMax(L) 

17: > Pull Record with largest A.v in L 

18: i{ W >0 then 

19: A+^{([At;,oo),f)} 

20: > "Rightmost" piece of Function A+ (y) 

^^{((0>S.f)} 

22: i> "Leftmost" piece of Function t{x) 

23: else 

24: A+ <(- 0, r 

25: r ^ ^.u > definition boundary so far of A"*" 

^ ^ X^; ^ definition boundary so far of r 

27: while L 7^ do 
28: A ^ GetMax(L) 

29: i <— A.i 

30: if A.t = "entry" then 

31: W + Wi 

32: D D -pi 

33: else > A is an exit point 

34: W i-W - Wi 

35: D-^ D + 1 

36: A+ ^ A+U {[A.V, r],D+f) 

37: > Function A+ from A.v to next breakpoint 

38: r <— A.v 

39: r^rUmO+^J,^)} 

40: > Function r from previous to current breakpoint 

41: e^D+^^ 

42: return A+, r 



8 



Proof From convexity, the solution is unique and the Kuhn- Tucker conditions specify its form: 
The partial derivatives 



dSi {pi + 6, 



\2 



are equal for all i such that 1 — pi > 5i > We denote this common value of ^j^rpy: by zix)- For i 
where 5i = 0, we have ^ < t_{x) and when < (5i = 1 — pi, we have Wi > t_{x)- 

Using this, we can now consider the solution as a function of x. We first sort all items with 
< Pi < 1 in decreasing order of Wi/pi and place items with = and Wi> Q ai the head of the 
sorted list. The list is processed in order: we first increase 5i until ^^^q^ = We then increase 
both 6i and 82 so that the ratios are all equal until p^^^^ = p^^s2 ~ ^ ^^'^ sto\) 
increasing a probability once it is equal to 1). At any point, the ratios — on the prefix 1, . . . , /i 
we have processed are all equal to some value y except for items where pi reached 1. The threshold 
y = z{^) can be interpreted as the rate of decrease of the sum of variances per unit increase in 
expected sample size after implementing optimally a total increase of x. It is a decreasing function 
of X and satisfies y € {wh+i/Ph+i-iWh/Ph]^ where h is the last processed entry on our sorted list. 
The new sampling probabilities q = p + S satisfy qi{y) = min{l, max{pi,Wi/y}}. Observe that the 
probabilities q{y) are PPS of size Yli=iPi + x oi the prefix of the list 

h 

{qi,...,qh) ^ ppsC^pi + X, {wi, . . . ,Wh)) , 

i=l 

and y is the threshold value of these PPS probabilities. 

Consider the (at most 2n) points defined by the ratios Wi/pi (when pi < 1 and Wi > 0) and the 
weights Wi. Consider the relation between x and y = t(x). The function is a hyperbola x = a + b/y 
for some constants a, h between consecutive breakpoints. 

The inverse function of t{x), which we call A"^(y), maps threshold values to probability in- 
creases, and satisfies 



X 



^^(y) = "^(liiy) - Pi 

i 

= minjl, maxfa, -Wi/z/}} - pi . 

i 

It is continuous and decreasing with range [minj|p.^]^ liJj, i?), where i? = c« if 3i with u)j > and 
Pi = and R = maxj|p,<x ^ otherwise. It is piecewise where each piece is a hyperbola. 

Algorithm [2] computes a representation of the functions A"'"(y) and r(x) each as a sorted list of 
pieces of the form (/, f{y)) where / are intervals that partition the domain and / is the function for 
values on that interval. For A"'"(y), the pieces have the form f[y) = a + h/y. The inverse function 
r(x) is simultaneously produced by the algorithm by inverting each piece (/, a + h/y) to obtain a 
respective interval where end point f; of / is mapped to end point f{v) = a + b/v (left end points 
are mapped to right end points and vice versa). The inverse function on the corresponding interval 
is r{x) = f-\x) = 

For each i with w-i > and pi < 1, Algorithm p creates an entry point with value ^ and for 
each item with positive Wi, creates an exit point wiui value Wi. These (at most 2n) points are the 
breakpoints of A~^{y). The breakpoints are processed in order to compute the functions A^{y) and 
r. The computation is performed in linear time after sorting the breakpoints. ■ 
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Example 4.3 Returning to our example (Example we consider the best increase for w and 

the uniform p. Since probabilities are uniform, Algorithm^processes entries in order of decreasing 
weight. The first entry on our list is entry 5, which has weight of 6. The best increase is applied 
only to this entry until the ratio matches the ratio of the second entry on the list, which happens 
when 6/(1/3 + 6) = 5/(1/3), that is, when 5 = 1/15. So the first piece of our function is for the 
interval x G (0, 1/15]. The corresponding distribution has entry 5 with probability 1/3 + a; (and all 
other entries with the original l/3j. The algorithm has active weight W = 6 and thus produces the 
threshold function r = 6/(1/3 + x). We can verify that the inclusion probability for changeout x 
is (75 = 6/t(x) = 1/3 + X. The second piece of the function associates increases for both entry 5, 
and the second entry on our list, which is entry 4. They are increased until the ratios are equal 
to the third entry on our list, which is entry 2. This happens when the increase to entry 5 is 
1/6, according to 6/(1/3 + 5) = 4/(1/3) and when the increase to entry 4 is 1/12 according to 
5/(1/3 + 5) = 4/(1/3). The total increase from entries {4,5} is 1/6 + 1/12 = 1/4 and we obtain 
that the second piece of the function is for x S (1/15, 1/4), and both entries {4,5} are increased. 
Algorithm^ computes the active weight = 6 + 5 and the threshold r(x) = ll/(a; + 2/3). Thus 
for an increase of x, entry 5 is sampled with probability 6/r(x) = (6/ll)(x + 2/3) and entry 4 with 
probability 5/t(x) = (6/ll)(x + 2/3). The third piece would have 3 active entries {2,4,5} (the three 
largest entries.). The active weight isW = \'o and the threshold is accordingly t_[x) = 15/(x + 1). 
The interval for this third piece is (1/4,2/3]. 

The least detrimental decrease 

We next address the complementary problem, of how to compute the best decrease to the proba- 
bilities. For X G [0, X^j^'di it is the solution of the optimization problem 



minimize (8) 

•I -P* ~ * 

Vi, Pi > (5i > 
= X 

i 

We first observe that any total decrease x < Dq, where Dq = Ylii\wi=oPi arbitrarily 
applied to probabilities of items with zero weight and does not increase the sum of variances. 

Lemma 4.4 For any decrease x G {Dq, ^^Pi\, there is a threshold y = t(x) such that the solution 
of ([s]) has the form 6i = pi—mm{pi,Wi/y}. The function t{x) and its inverse A^{y) are increasing, 
continuous, and piecewise with at most n breakpoints where each piece is a hyperbola. Moreover, a 
representation of both functions can be computed in O(nlogn) time. 

Proof For any decrease x G {Dq, J2iPi\^ solution is unique and specified by the Kuhn- Tucker 
conditions. For all i where Wi = we must have 5i = pi. For all i such that 6i £ {0,pi), the partial 
derivatives 

Cp^_S. _ wf 



d5i {pi - (5j)2 

are equal and we denote the common value of ^.^^ by r(x). When 6i = 0, we must have ^ > r(x) 
The solution thus has the form 

6i=pi- mm{pi, Wi/y} = max{0,pi - Wi/y} . 
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-Pi 



while L / do 

A ^ GetMin(L) 



W 
D < 

A- 
r 

f i- 



A.i 

-W + Wi 
-D+pi 
^ A 



> Initial active weight W 



Algorithm 3 The function A (y) : (0,oo) and its (extended) inverse r : (0, X^jPi) for w and p 
1 
2 
3 
4 
5 
6 
7: 
8: 
9: 
10 
11 
12 



procedure Best-Decrease(i(;,p) 
W ^0 
D ^0 

A- ^ 0, T ^ 
for i G [n] do 

if {wi > 0) then 

{A.v,A.i) ^ C- 
L ^ LUA 
else, D -i^ D + Pi 

T^{{0,D],0) 
£ = D 
r = oo 



> value, item 
add entry record to L 
> Wi = 

> Threshold function is for y < D 
> Left boundary of current definition of r 
> Right boundary of current definition of A^^ 



> Pull out record with minimum A.v 



U{[A.v,r),D-f) 
w 

A.v 



D 

rU(5:Z)-^],^) 



D 



w 

A.v 



23: return A , r 



We obtain the relation 

x = ^5i= ^ {pi-Wi/y), 

i i\wi/y<pi 

which defines the functions y = t{x) and its inverse x = A~(y). Both t{x) and its inverse A~(y) are 
continuous and decreasing. The range of t{x) (domain of A.~{y)) is [imni\p.yQWi/pi,Qo). Both are 
piecewise with at most n breakpoints which correspond to the values of the ratios Wi/pi. Algorithmjs] 
computes a representation of these two functions. It first orders the items as in Algorithm [2] and the 



proof of Lemma 4.2 , according to the ratios Wi/pi and processes this list in reverse order, generating 
the pieces of the two functions. Each function is represented as a list of consecutive pieces of the 
form (/, /(y)), where / is an interval. For convenience, we define t{x) = for x < Dq. For A~(y), 
/(y) has the form /(y) = a — b/y. The algorithm inverts /(y) to obtain the corresponding piece of 
the function r(x): an end point t> of I is mapped to end point a — h/v (left end points are mapped 
to right end points and vice versa) and the function is b/{a — x). 

The probabilities q ^ p — S computed for a given x are PPS with threshold is y: 

n 

{qt, ...,qn)^ PPs(^Pi - X, {wt, . . .,Wn)) ■ 
i=t 
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Figure 1: The thresholds r and r as a function of changeout x for Example 4.1 
with maximum changeout (4/3) is provided as a reference. 



PPS threshold 



Example 4.5 We compute the best decrease for Example o,s computed by Algorithm^ We 
obtain that for x G (0, 1/3] we only decrease the probability for entry 6, which has weight ofO. We 
then decrease only the second smallest entry, which is entry 3 with weight 1. The interval of the 
second piece is accordingly x G (1/3, 1/2], we have active weight 1, t{x) = 1/(2/3 — x), and ^6 = 
and q2 = 1/t(x) = 2/3 — X. The third piece x £ [1/2,2/3] involves entries {1,3} . The active weight 
is W = 3 and t{x) = Accordingly, the probabilities are qq = 0, q2 = 2/t{x), and qi = 1/t{x). 

Computing a A-stable PPS distribution 

We consider computing A-Opt{w,p, D), which is the distribution with maximum fitness (min- 
imum variance) (j){q,w), amongst q that are within distance D from p {\\p — q\\i < D). The 
applicable values of D are in the interval [0, ||pps(A;, w) — p||i] and clearly A-Opt(id,p, 0) = p and 
A-Opt(id,p, ||pps(/i;, id)— p||i) = pps(fe, w). For general D, we obtain the distribution by combining 
the best increase of D/2 and the best decrease of D/2. The two are disjoint for D < ||pps(A;, w)— 
Pseudocode is provided in Algorithm |4| 



Example 4.6 On Example 4-1 ■ the optimal PPS has changeout 4/3 and thus we are interested 
in the A-stable distribution for D < 4/3. We do this by combining the best increase of D/2, 
which affects a prefix of the entries {5,4,2} with the best decrease of D/2, which affects a prefix 
of the entries {6,3, 1}. If D = 1, the best increase of x = 0.5 involves all three entries. We have 
T = 7.5 and accordingly the entries {5,4,2}, which have weights {6,5,4}, have q^ = 6/7.5 = 0.8, 
54 = 5/7.5 = 2/3, and q2 = 4/7.5 = 8/15. The best decrease of x = 0.5 has r = 6. Accordingly, 
qe = 0, qs = 1/6, and qi = 2/6 = 1/3. Therefore, the A-stable distribution for D = 1 is 

18 12 4 
3' 15'6'3' 5'° 

Figure [I] illustrates the thresholds r and t as a function of the changeout x. When the changeout 
is D = 4/3, the increase and decrease are D/2 = 2/3 and both thresholds meet the PPS threshold, 
reflecting the fact that the A-stable solution is the best-fit PPS ([s]). 



Lemma 4.7 A representation of A-Opt{w, p, D) can be computed in 0{nlogn) time. The repre- 
sentation supports computing the output distribution for a particular value D in time linear in the 
number of entries with Pi ^ qi. 



12 



Algorithm 4 Compute the distribution A-Opt{w,p, D) 



function OPT{w,p, D) 
X ^ D/2 
T ^ t{x) 
T ^ f ix) 

if T = then x 
for i G [n] do 

if Pi < ^ then 

HI ^ 'P 

else if T = then 

if > and = then 

qi ^ max{0,pj — K\ 
R ■(^ R- Pi 

else if Pi < Sj. then > T > and pi G [|^, 

else > T > 0, p, e [f , f] 

Qi ^ Pi 

return q 



Proof Algorithm [4] computes A-Opt{w, p,D) in 0(n) time after computing the thresholds f{x) 
and r(x) for x = D/2, using Algorithms [2] and [sj If we are interested in a particular D, the 
algorithms can be stopped once the piece of the function containing x = D/2 is computed. This 
takes 0(n log n) time of Algorithms [2] and [sj The complete representation is the combination of 
the representations of t(x) and f{x). If it is precomputed, Algorithm |4] returns the distribution in 
time linear in the number of modified entries. ■ 



Computing an a-stable PPS distribution 

For a given stable solution, we can consider its changeout D from the starting distribution and the 
Lagrange multiplier a, which is the rate of decrease of the sum of variances per unit change after 
implementing a total change of D in sampling probabilities. The relation between the two is 

a{D) = a = T{D/2f - f{D/2f. 

Lemma 4.8 A description of the function a{D), and a solution for an equation a{D) = a, can he 
computed in O(nlogn) time. 

Proof The function a{D) is a piecewise rational expression which is decreasing with D. Its 
breakpoints are the union of the breakpoints of r and r. For each piece, we can compute the 
respective range of D values and of a values. To compute the a-stable distribution for a given a, we 
locate the piece in which a is in the range and solve for x the equation a = T(a^)^ — f{x)'^. Within 
a specified piece, this yields a quartic equation (finding the root of a polynomial of degree 4), and 
we take the root x that lies in the domain of the piece. 

Once x is computed, we have the changeout D = 2x which corresponds to a and the thresholds 
r(x) and f{x). We can substitute the thresholds in the initialization of Algorithm |4] and apply it 
to compute the a-stable distribution. ■ 
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Combining these pieces provides the result claimed by Theorem 4.1 
4.2 Incrementally maintaining the stable PPS distribution 

We now show how we can efficiently maintain an a-stable distribution when modifications to the 
weight vector are made one entry at a time. Consider an a-stable distribution and the set of ratios 
Wi/pi. Following the notation we used in the batch case, let r be the largest ratio with pi < 1; and 
let r be if there is an entry with zero weight and positive probability and otherwise the smallest 
ratio with Wi > 0. The difference of the squares is at most a: — t"^ < a. When the weight of an 
entry i is modified, its ratio changes. If this results in a maximum difference of squares exceeding 
a, we need to recompute the a-stable distribution. This event can happen only when either pi < 1 
and Wi/pi > T or Pi = 0, Wi > and r > or Wi/pi < r. The modification that involves the 
modified entry itself must be at one end of the ordered ratios list and a sequence of entries from 
the other end of the ordered ratios. We increase r to simultaneously compensate for decreasing 
r or vice versa, but noticing we never need to exceed the original threshold value at the affected 
end. The total number of modified probabilities, however, can be large, even when we modify the 
weight of a single entry. On the other hand, the expectation of the change in the sample (the Li 
distance between the initial and final a-stable distributions) is at most 1 and we can ensure (as we 
outlined in the subsampling discussion) that there is at most one insertion and one deletion from 
the sample. 

Our treatment of the incremental problem introduces a representation of the a-stable distri- 
bution, which allows for efficient tracking and modifications of sampling probabilities of sets of 
entries and also efficient maintenance of a corresponding sample. This is combined together with 
amortized analysis which charges the modifications to the representation to previous modifications 
in the input. We show the following (the details are deferred to Appendix [A|) : 

Theorem 4.2 An a-stable PPS distribution, and a corresponding sample, can be maintained in 
amortized polylog(n) time per modification. 

5 Additive fitness and changeout cost 

We now describe a general approach for a broad class of optimization problems. We say an opti- 
mization problem with input domain of the vectors X C i?" and outputs S that are subsets S C [n] 
has additive fitness if its objective is to maximize the fitness function 4>{S,x) = X^jg^Xj: 



OPTfa;) = argmax >^ 



The changeout cost between two outputs is additive if there are constants Cj such that d{S, S') = 
YlieS'\s '^i- "^^^ constant Cj can be interpreted as the cost of "bringing in" i to the output. When 
the optimization problem has additive fitness and changeout cost, the a-stable optimum w.r.t. 
current output S can be posed as a best-fit instance on a modified input vector with an appropriate 
quantity subtracted from entries in i ^ S: 

Theorem 5.1 Let OPT be an optimization problem with additive fitness and changeout. The a- 
stable optimum satisfies a-OPT{S,x,a) = OPT(y) , where yi = Xi — Ii(^sCia and I is the indicator 
function. 
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Proof From ([T]), 



a-OPT(5, X, a) = argmax^ — a a + ^^x, 

^ '^^ i£S'\S i&S' 

= argmax ^^yj = OPT{y) . 



We now consider maintaining an a-stable optimum efficiently under incremental updates. We 
show that this can be done using an off-the-shelf dynamic opt algorithm: 

Theorem 5.2 // fitness and changeout are additive, then when the input is modified, the a-stable 
solution can be maintained by applying a dynamic OPT algorithm with a corresponding number of 
modified entries. 

Proof For an input x and current output S, we consider the vector y such that yt = Xi — li^so-Ci. 
Recall that the a-stable optimum is the best-fit with respect to input y. It therefore suffices to 
translate the modifications in x to the modifications in y. When an entry of x is modified, we apply 
a corresponding modification to the same entry in y. When the algorithm modifies the output, we 
increase by acj the value of new entries inserted to S and decrease by acj the value of entries that 
are deleted from S. The total number of updated entries in y is the sum of the updates to the 
input and the set difference between the a-stable outputs. ■ 



Fixed-size output and uniform costs 

In this section, we study problems that possess additional structure, namely fixed-size output, 
\/S G S, \S\ = k and uniform costs Ci = 1. In particular, this formulation applies to matroids. 
We can equivalently treat changeout costs as those of excluding entries currently in the output. 
Examples of such problems include Top-fc, MST, and assignment, which we study individually 
below. 

Specializing Theorems 5.1 and 5.2, the a-stable optimum can be computed by applying opt to 
a vector which adds a fixed value a to entries in 5: a-OPT(S', x, a) = OPT(y) , where yi = Xj -h/jg^a: 

Q-OPTfS, X, a) = arg max ( — alS" \ 51 -|- x, 
S'es \ ^-^ 

= arg max ( a| 5' Pi 51 -|- > x, 
S'es \ ' 

i&S' 

= argrnax yi = OPT{y) . 

^ ^'^ i&S' 

Since y > x, our reduction can also be used when the domain of OPT is non- negative. 
We can now understand the structure of the objective of a-OPT as a function of a: 

Lemma 5.1 The function max^/g^ ( X^jg5/ Xj + a|5 H 5'|) is the maximum of at most k linear 
functions. Thus, it is convex and piecewise linear with at most k breakpoints. 
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Proof The objective is a (restricted) parametric version of the original optimization problem |15| 
[T3| [T^ [HI [To]. In a solution with respect to a parameter a, the weight of each entry is replaced 
by the linear function Xi + a when i £ S and is fixed to Xi otherwise. The value of a particular 
output S' with respect to S and a is X^jg^/ Xi + a\S Ci S'\. The optimal solution as a function of 
a is the maximum over all 5' G S. These are linear functions with at most k + 1 different slopes, 
corresponding to the intersection size IS" H S'\. For each intersection size {0, . . . , k}, there is one 
dominant linear function. The maximum of linear functions with slopes {0, . . . , A;} is such that each 
function dominates in at most one piece and the slopes are increasing. ■ 



If we explicitly compute a representation of the objective function, we can obtain a A-stable op- 
timum, by identifying a corresponding value of the parameter a: The breakpoints of the function 
correspond to decreasing changeouts (increasing slopes) from changeout k (slope 0) to changeout 
(slope k). The marginal gain in fitness from each unit increase in changeout is the respective value 
of a at the breakpoint. Hence: 

Corollary 5.2 When the problem has additive fitness, additive and uniform changeout, and fixed- 
size output, the tradeoff is concave. 

There are at most A; + 1 different a-stable solutions (eliminating duplications) and thus all can be 
specified in 0{k'^) space. When the tradeoff is monotone, then this description requires only 0{k) 
space. This is because there are at most 2k entries involved in total, and each has a single exit or 
entry point as we sweep the parameter. We will show that monotonicity holds for Top-k and MST 
but not for assignment. 

5.1 Stable top-k 

The top- A; problem has input domain X C -R" so that Xi is the 'weight' of i. The permitted outputs 
S are all subsets S C [n] of size k, and the goal is to maximize (p, the sum of weights of the selected 
subset. The problem satisfies the conditions of Theorem [5. 1| and therefore the a-stable top-A: with 
respect to input x and current set S is a-top-A;(S', a;, a) = top-k{y), where yi = Xi + lieso-- 

Lemma 5.3 The stability-fit tradeoff for stable top-k is monotone and can be found in 0{n+k log A;) 
time. 

Proof Consider the sequence of swaps made by a-top-k(S', x, a) while sweeping a. For sufficiently 
small a > 0, a-top-k(S', a;, a) = S. For sufficiently large a, a-top-k(S', a;, a) = top-k(a;). The total 
number of swaps is the set difference d{S,iop-]<.{x)). The swaps are between the lightest i £ S and 
the heaviest i ^ S. 

To compute all tradeoff points, we sort the entries in S by increasing Xj to obtain the order 
(ii, . . . , i/;), and sort the top-A: entries i ^ S in decreasing order (ji, . . . , j^). This obtains the stated 
time bounds. Let H = argmax/jXi^ < Xj^. For h < H, the most beneficial swap of at most h 
items, swaps-in items ji, . . . ,jh and swaps-out items ii, . . . ,ih- The a-stable swap uses the maxi- 
mum h < H such that Xj^ — > a. The tradeoff is clearly monotone, as each entry is swapped 
in or out at one particular value. ■ 



Incremental algorithm: The problem also satisfies the conditions of Theorem 5.2 and therefore 
to obtain an incremental algorithm, we outline a simple dynamic algorithm for maintaining a top-A; 
set. We use two heap data structures. The first heap is a min-heap of the k cached entries (we call it 
the "in- heap" ) . The second heap is a max- heap of the remaining n — k entries ( "out-heap" ) . Weight 
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updates arc performed as updates to the weights of the corresponding elements in the heap. Swaps 
cause the deletion of the minimum item in the in-heap and the maximum item in the out-heap, 
and their reinsertion in the other heap. Swaps are performed when Xi — xj > a, where i is the 
maximum item in the out-heap and j is the minimum item in the in-heap. In this algorithm each 
update has an overhead of O(logn) — note that an update of a single value can result in at most 
a single swap. 

The optimal offline algorithm: We now provide an optimal algorithm for offline stable top-A;. 
The offline problem is to maximize 

j i&Sj j 

We can obtain the offline stability-fit tradeoff by sweeping a. The same set of solutions can be 
obtained by sweeping the maximum number of cache swaps K, and maximizing X^ieSj "^1^^ 
subject to J2j I'^j \ ^ ^- For each item i and subsequence C [N], we compute the 

benefit of caching i during the subsequence: 

e 

bije = E ^i"^ ■ 

h=j 

We seek a cover with at most K + k distinct intervals such that there are at most k active intervals 
at each step h and the cover has maximum benefit. This can be formulated as an integer program 
with variables yije G [0, 1]: 

maximize ^kjeyije 
subject to ^ Uiji < K + k 

ijl 

and V/i G [n], E E E - ^ 

j<h, £>h ie[n] 

This problem can also be modeled as min-cost multi-commodity flow. It can also be solved via 
dynamic programming in polynomial time. For maximum stability (K = 0), we simply choose the 
top-fe keys with maximum bi^i^N- 

Example: Consider an input vector x = (1,4,7,5). The top-2 set is the entries {3,4}. Suppose 
k = 2 and at the current output is the first two entries S = {1,2}. The best swap is 1 -H- 3 (swap- 
ping the smallest cached value with the largest uncached value) and the second most beneficial 
swap is 2 -H- 4 (swapping 2nd smallest cached with 2nd largest uncached). The optimal tradeoff 
had 3 points: performing no swap, only the first swap, or both swaps. Looking at a-stable solution, 
the first swap has utility gain 7—1 = 6 and the second swap has utility gain 5 — 4 = 1. Thus the 
first swap is performed when a < 6 and both when a < 1. ■ 

Our results carry over to fitness functions with the more general form 0(5', £c) = X^jgg '0?(a^i), 
by treating ipix) as the input. This generalization allows us to measure fitness by any Lp norm, 
using tp{x) = \x\^. The choice of p matters when processing a sequence of inputs using the same 
a value: Augmenting our example, consider the a-stable outputs for two inputs: x = (1,4,7,5) 
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5+a 



Figure 2: Example demonstrating non-monotonicity of assignment 



and another vector z = (2,3,8,4), both starting with current output {1, 2}. The vector z has the 
same tradeoff curve as x when using the Li norm. If we want to use the L2 norm for fitness, we 



use ip{x) 



In this case, the two most valuable swaps have utility gains of 7 



48 and 



52 _ 42 _ g g^j^^^ §2 _ 2^ = 60 and 4^ — 3^ = 7 on z. Therefore, on the same a, and looking at 
both vectors together, it is possible that only the first swap is performed on the x (60 > a > 48), 
only the first swaps are performed on both (48 > a > 9), the two swaps on x and only the first 
swap on z (9 > a > 7), or all four swaps (a < 7). 

5.2 Stable MST 

In the stable MST problem, the inputs are a set of edge weights and the outputs are spanning trees. 
The fit of a tree is its negated weight. The change cost between two trees is the set difference of 



included edges. The problem clearly satisfies the conditions of Theorem 5.1 hence, an a-stable MST 



can be obtained by adding a to the edge weights of edges in the current output tree and computing 



an MST. Applying Theorem 5.2 , dynamic a-stable MST can be handled using off-the-shelf dynamic 
MST algorithms ^I2j . 



Lemma 5.4 The tradeoff for stable MST is monotone. 



Proof Consider starting from an optimal MST and slowly increasing a, obtaining a sequence of 
a-stable MSTs. An edge is removed when it is the heaviest edge in some cycle. This can only 
happen if this edge is not a member of the output S (because weights of all members of S decrease 
by the same amount and all weights of other edges stay the same). Once this happen, weights of 
other edges in the cycle either keep decreasing or stay the same, so the edge remains the heaviest 
in that cycle and out of the MST. Similarly, an edge is inserted into the a-stable MST when it 
has smallest weight in some cut. Such an edge must be a member of S and its weight remains the 
smallest in its cut as a increases. Thus the tradeoff is monotone. ■ 

The tradeoff can be computed using a parametric MST algorithm but the computation can be 
made more efficient for this special case. 

5.3 Stable assignment 

In the stable assignment problem, the inputs are weights of edges in a complete bipartite graph 
and the outputs are maximum matchings. The problem satisfies the conditions of Theorem |5.1[ 
and hence, an a-stable solution can be computed by adding a to all the weights of edges present in 
the current matching and then finding a maximum-weight assignment (maximum weight bipartite 
matching). We also know there are at most n + 1 distinct stable optimal assignments (assuming 



18 



0.038 
0.036 
0.034 
0.032 

0.03 
0.028 
0.026 
0.024 
0.022 

0.02 
0.018 



EWMA PPS: Indep. 

EWMAPPS: PRN 
Stable PPS: Indep. 

Stable PPS: PRN 




V 



700 



100 200 300 400 500 600 
changeout 

(a) Stable PPS uniformly and noticeably outperforms 
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(b) Similar performance for Stable and EWMA top-fe, 
although computational aspects favor Stable top-fc. 



Figure 3: Tradeoff between stability and fit comparing Stable with Time Decay EWMA. 



a single optimal assignment is used). From Theorem 5.2, dynamic a-stable assignment reduces to 
dynamic maximum bipartite matching |16j . 

However, unlike the previous problems, the tradeoff is non-monotone. Specifically, Figure [2] 
shows a simple example of an assignment problem which is non-monotone in its inclusion of edges. 
It shows a 10 node bipartite weighted graphs (where missing edges are assumed to have weight 
zero). A matching from the previous timestep causes five edges to have their (current) scores 
increased by a. We can find the tradeoff by considering solutions which include different numbers 



of the previous edges, as per Lemma 5.1 The solution that picks none of the previous edges has 
weight 31. The solution that picks one of the previous edges picks the top edge, and has weight 
30 -|- a. The solution that picks three previous edges has weight 24 -|- 3a, and picks the three edges 
with weight 5 + a. Lastly, the solution that picks all five previous edges has weight 15 -|- 5a. The 
cases of two and four previous edges are dominated by these solutions. The a-OPT objective is the 
maximum of the four linear equations: y = 31, y = 30 -|- la, y = 2A + 3a, y = 15 -|- 5a, and each 
dominates the other three for a different range of a. Further, note that the top a edge is included 
for a G (1,3) and not for a G (0,1) and (3,4.5). Therefore, the tradeoff is not monotone in its 
inclusion of edges. 

6 Illustration and Comparison with Time Decay 

We provide a brief performance comparison of Stable PPS and Stable top-k against their time- 
decayed counterparts. 

Experimental Set-up. The data comprises IP flow records aggregated over the 144 ten-minute pe- 
riods over the span of a day, obtained from a large ISP. On average, there were 61, 292 (anonymized) 
keys in each period t, from a total of 2, 516, 594. The results shown are for a summary size of 
k = 1, 000 in each case. 

For the time decayed version we constructed for each key i an Exponentially Weighted Moving 
Average (EWMA) Xi^t of the process Xi^f In EWMA PPS, at each time t, we perform PPS selection 
over the keys present but using EWMA weights, i.e., the weight set {xi^t '■ Xi^t > 0}, and formed 
the Horvitz-Thompson estimate of the true weight as an auxiliary variable, i.e., Xi^t/Pi,t for selected 
items, where pi^t is the selection probability. In EWMA Top-k we selected at each time t the set St 
of k items of highest EWMA weight Xi^t, and as measure of fit computed the corresponding sum of 
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original weights J^ieS, ^ht- 



PPS Sampling: Figure |3(a)| shows the stabihty-fit tradeoffs for Stable PPS and EWMA PPS. 
Fit is measured as the square root of sum of weight estimation variances. Fit and changeout costs 
are averaged over the 144 periods. For EWMA PPS we swept the mean m of the geometric decay 
from 1 (no history) to 64 periods. We computed the A-stable PPS distributions sweeping target 
maximal changeout cost D between and the target sample size k = 1,000. For both methods 
we compared independent (pseudo)random number generation, and permanent random numbers 
(PRN) for each key. 

EWMA PPS with independent random numbers has much larger changeout cost (in the range 
620-650) than the other methods. Using key based PRN with EWMA PPS much reduces the 
change cost. Stable PPS performs noticeably better, achieving very small changeout (around 50 for 
the PRN variant) with relatively little growth in error as changeout is reduced (about 15% growth 
compared with about 70% growth for EWMA). PRN has little impact on variance because, for a 
given input set, marginal variance is unchanged. 

Top-A;: Figure |3(b)| shows stability-fit tradeoffs for Stable top-A; and EWMA top-k, where fit is 
represented by the deficit between the total weight of the top-A; estimate, and the true total top-A; 
weight. Fit and changeout are again averaged over the 144 periods. The performance of both 
are similar, although the stable version does obtain a better tradeoff. Nevertheless, we note that 
stable methods are preferable to EWMA methods for computational reasons: they are generally 
faster, and do not require historical data to be retained to determine the output. Moreover, the 
stability or flexibility parameters can be modified directly at each iteration, whereas under time- 
decay these parameters can be controlled only indirectly through the decay parameters. Changing 
decay parameters requires extensive historical data to be retained, and rescanned for each change. 

7 Stable Extensions of Optimization Problems 



In general, we can consider the stable extension of almost any optimization problem. Theorem 5.1 
shows that for a natural class of problems, a-stability reduces to a single modified instance of the 
original optimization problem. However, there are many other problems which do not admit such a 
reduction. We considered the case of PPS sampling, and showed procedures which find the optimal 
solution. Also of interest are those problems where the original optimization is known to be hard: 



then we seek approximation algorithms for the stable extension (making use of Theorem 5.1 if 
possible) . 

In this section, we provide further examples of problems that naturally admit stable extensions: 

In the 0/1 knapsack problem, items have values Xi and weights Wi, and we have a parameter W. 
The goal is to find a set of items of maximum value, subject to an upper bound of W on the 
sum of their weights, Yli(^s — Under additive changeout cost, this fits the requirement of 



Theorem 5.1, and so any approximation algorithm for 0/1 knapsack provides the corresponding 
approximation for the a-stable version of the problem. 

In shortest-paths routing the output is a tree routing to a single destination. The fitness is the 
(weighted) sum, over nodes, of the path length to the root. The changeout cost between two trees 
is the set difference (viewing trees as directed to the destination) , which corresponds to the number 
of modifications to the routing table for the destination. 

The k-set cover problem is, given a fixed collection of sets, to pick at most k to maximize coverage 
of X. The fitness is the sum of weights of covered items, and we consider the uniform changeout cost 
function. We outline a reduction from an a-stable A;-cover to the optimum A;-cover on a modified 
input. Our modified instance contains all the original items and sets. It also contains a new item 
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that is unique for each set in the current output S. The new items have weights o. We argue that 
a cover is a-stable for a if and only if it is an optimal A:-cover of the modified input. 

7.1 fc-center Clustering 

It is natural to consider the stable extension of various clustering problems. Here, we wish to 
minimize the cost of the new clustering plus the cost of switching out some of the cluster centers. 
In this section, we show how to find a constant factor approximation for the /c-center objective in 
this setting. 

Given the stable extension of any optimization problem, an a-optimal solution can always be 
computed by exhaustive search on possible outputs. A method with smaller search space is to 
first construct an algorithm for an extended version of the optimization problem which allows 
a part of the output to be fixed. This extension is a special case of the stable extension. We 
can then perform a smaller search over all subsets of the current output, and apply the (restricted) 
extended algorithm to each instance. Some problems naturally yield to such extensions: for covering 
problems, we remove items covered by the fixed component and apply the (approximate) covering 
algorithm to what remains. 

A less straightforward case arises in the context of clustering. Here, we want to extend a given 
subset of cluster centers with some additional centers to minimize a particular cluster objective. 
For the fc-center objective, the existing 2-approximation algorithm can be extended to handle fixed 
facilities (this "restricted" extension is also interesting in itself). More formally, for points in a 
metric space m, we have: 

Lemma 7.1 Given a fixed set of centers F and a set of n points P, we can find a set of clusters 
C of size s that guarantees to 2- approximate the k-center objective, i.e. 

max min m(p, c) < 2 min max min m(p, c) 
peP cecuF C:\c\=s peP cecuF 

Proof Here, we adapt the 2-approximation algorithm due to Gonzalez [9]. Starting with C = 0, 
we pick the point from the input that is furthest from the current centers in C L) F, i.e. 

C-^CUjargmax min m(p,c)} 
p&p ceCuF 

and iterate until \C\ = s. This requires 0((|F| + s)n) distance computations. 

To see the approximation guarantee, consider the next point q that would be added to S if we 
continued the algorithm for one more iteration. This is some distance d from its closest center in 
C D F. Therefore, we have that all points in C U {q} are separated by d, and further that every 
point in CU{q} is at least d from every point in F. This can be seen by contradiction: if any point 
in C were closer than d, then it would not have been picked as the furthest point ahead of q. 

This provides a witness for the cost of the clustering. There are two cases: (i) some point in 
C U {q} is assigned to some center in F under the optimal clustering; or (ii) some pair of points in 
CL){q} are assigned to the same center under the optimal clustering. If (i) holds, then immediately 
we have that the cost of the optimal clustering is at least d. If (ii) holds, then we have two points 
separated by distance d in the same cluster. Then the cost of the optimal clustering is at least 
d/2, which happens when their cluster center is distance d/2 from each (it cannot be closer to both 
without violating the triangle inequality). 

On the other hand, since q is the furthest from its closest center in C L) F, it follows that all 
points are within d of their closest center, and therefore the cost of the clustering found by the 
algorithm is d, which is at most twice the optimal cost. ■ 
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This indicates that we can find a 2-approximation for the stable extension by considering all 
subsets of the current clustering, and applying the above lemma. However, for subsets larger than 
A;/2, we can simply work with the (approximate) solution that switches out all k centers for the 
optimal fc-centcr clustering on the current set: this at most doubles the changcout cost, while 
improving the clustering cost. Thus, we only need to consider subsets of size at most k/2. 

8 Concluding Remarks 

We have introduced and formalized a model of stability and fit, where the benefits of an improved 
solution have to be traded off against the costs of changing from the current status quo. From 
this motivation, we have provided stable extensions of sampling and optimization algorithms, and 
demonstrated their ability in practice. 

Our work was inspired by the need to improve the performance of an application management 
system. Our ongoing work is to extend our comparison to these scenarios, and evaluate against 
previously deployed ad-hoc approaches. 

References 

[1] NIST/SEMATECH e-Handbook of Statistical Methods. 2012. 

[2] P. K. Agarwal, D. Eppstein, L. J. Guibas, and M. R. Henzinger. Parametric and kinetic 
minimum spanning trees. In FOCS, 1998. 

[3] A. Borodin, N. Linial, and M. E. Saks. An optimal on-line algorithm for metrical task system. 
J. ACM, 39(4):745-763, 1992. 

[4] K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single popu- 
lation. Australian Journal of Statistics, 14(3):231-239, 1972. 

[5] E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. Efficient stream sampling for 
variance-optimal estimation of subset sums. SIAM J. Comput., 40(5), 2011. 

[6] E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. J. Algorithms, 59:19- 
36, 2006. 

[7] P. M. Fenwick. A new data structure for cumulative frequency tables. Softw. Pract. Exper., 
24(3):327-336, 1994. 

[8] G. Gallo, M. D. Grigoriadis, and R. E. Tarjan. A fast parametric maximum flow algorithm 
and applications. SIAM J. Comput., 18:30-55, 1989. 

[9] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Com- 
puter Science, 38(2-3) :293-306, 1985. 

[10] D. Gusfield. Parametric combinatorial computing and a problem of program module distribu- 
tion. J. Assoc. Comput. Mach., 30:551-563, 1983. 

[11] J. Hajek. Asymptotic theory of rejective sampling with varying probabilities from a finite 
population. The Annals of Mathematical Statistics, 35(4):1491-1523, 1964. 

[12] J. Holm, K. dc Lichtenberg, and M. Thorup. Poly-logarithmic deterministic fully-dynamic 
algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity. J. ACM, 
48(4), 2001. 



22 



[13] R. M. Karp. A characterization of the minimum cycle mean in a digraph. Discrete Mathematics, 
23:309-311, 1978. 

[14] R. M. Karp and J. B. Orhn. Parametricshortestpath algorithms with an application to cyclic 
staffing. Discrete Applied Mathematics, 3:37-45, 1981. 

[15] N. Megiddo. Combinatorial optimization with rational objective functions. Math, of Oper. 
Res., 4:414-424, 1979. 

[16] P. Sankowski. Faster dynamic matchings and vertex connectivity. In SODA, pages 118-126, 
2007. 

[17] R. E. Sleator, D. D.and Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652-686, 
July 1985. 

A Incremental stable PPS 



Section 4.2 provided an overview of our approach to incremental ("dynamic") maintenance of an 
a-stable PPS distribution under modifications of the weight of a single entry at a time. 

We now provide full details of the data structures we use. To gain intuition, we first consider the 
special case (a = 0) of maintaining a PPS distribution. Conceptually, entries are maintained in a 
sorted order according to their weight and we also track the respective r value. We use a structure, 
such as Fenwick tree [7], that supports prefix sums, lookups, insertions and deletions in O(logn) 
time. When an entry is modified, it is deleted from its previous position and reinserted with its new 
weight. If the weight decreased (increased), the threshold can only decrease (respectively, increase). 
The respective new r (new solution of (|4])) can be computed with a logarithmic number of lookups. 
An actual sample can be maintained using permanent random numbers (PRN). This is facilitated 
by maintaining another sorted order according to the ratios Wi/ui (where Ui is the PRN of entry 
i). The sample includes all entries with Wi/ui > t. For this, we use a data structure that supports 
insertions, deletions, and lookups (of the smallest ratio that is larger than a value) in O(logn) time 
such self adjusting binary tree |17j . 

In the case of maintaining a stable sample for general a, probabilities in the stable sample only 
need to be modified when the ratio Wi/pi of a modified input entry is outside the current (r, r) 
range. In the batch algorithm we worked with the ordered list of ratios Wi/pi. The prefix and suffix 
of this list formed PPS samples, and thus, within the suffix and prefix of the ordered ratios list, 
the ratio order corresponds to the weights order. But in the middle range, for ratios in {t,t), the 
order of ratios may not correspond to the order of weights. 

A critical observation for our analysis is that once two entries are "processed together" in a 
prefix or suffix of the sorted ratios list, and thus the heavier one precedes the lighter one, their 
relative position remains the same in the sorted ratios list, even if they are no longer members of 
the prefix or suffix (have ratios below r or above r). This situation only changes, i.e. Wi > wj 
and Wi/pi < Wj/pj when the weight of at least one of the entries is modified. Intuitively, in our 
analysis, we charge the cost of "correcting" this order when an entry works its way back into a 
prefix or suffix to the modifications in the input. 

The main structure that we use to support this analysis is what we call a stretch, which maintains 
a subset of entries. The stretch supports the following operations in O(logn) time: deletion of an 
entry, splitting a stretch into two stretches that include all entries with weights that are respectively 
below and above some value, merging two stretches with non-overlapping weight ranges, performing 
a lookup of the smallest/largest weight above some value, and returning the weight of all entries 
that are below or above a certain value. Consequently, two stretches with overlapping weight ranges 
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can be merged in time proportional to log(n) times the number of interleaved segments in their 
joint sorted order. 

Instead of treating entries as single entities, they are grouped to stretches. All entries currently 
in the same stretch share the same ratio r = Wi/pi (which we refer to as the threshold of the 
stretch) or else have Wi> t and pi = 1. Wc thus assign a ratio to a stretch as the range that spans 
its threshold and its largest weight. All stretches are maintained in a structure which supports 
insertion, deletion, and accessing stretches with maximum or minimum threshold in O(logn) time. 
This structure corresponds to the "sorted list of ratios" in the batch algorithm. 

When an entry is modified, it is deleted from its current stretch and inserted as a single-entry 
stretch. To recompute the distribution in response to the modification, wc apply an extension of 
our batch algorithm that can manipulate stretches: If the update produces a higher maximum 
ratio or lower minimum ratio and a-stability is violated, we obtain new values for r and r while 
processing stretches by decreasing and increasing ratios, as done (for single entries) by the batch 
algorithm. 

As the prefix (and suffix) of the "sorted" ratios are processed in decreasing ratio order, stretches 
are merged into a single prefix (or a single suffix) stretch. We stop when the difference of squares 
is exactly a. The cost of merging stretches depends on the number of interleaved sorted segments 
of stretches, and each segment can be "charged" to a modification of an item. 
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